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ABSTRACT 


Enhancing microdata access is one of the strategic priorities for the Australian Bureau 
of Statistics (ABS) in its transformation program. However, balancing the trade-off be- 
tween enhancing data access and protecting confidentiality is a delicate act. The ABS 
could use synthetic data to make its business microdata more accessible for researchers to 
inform decision making while maintaining confidentiality. This study explores the syn- 
thetic data approach for the release and analysis of business data. Australian businesses 
in some industries are characterised by oligopoly or duopoly. This means the existing mi- 
crodata protection techniques such as information reduction or perturbation may not be 
as effective as for household microdata. The research focuses on addressing the following 
questions: Can a synthetic data approach enhance microdata access for the longitudinal 
business data? What is the utility and protection trade-off using the synthetic data ap- 
proach? The study compares confidentialised input and output approaches for protecting 
confidentiality and analysing Australian microdata from business survey or administrative 


data sources. 


Disclaimer: the results of these studies are based, in part, on tax data supplied by the Australian 
Taxation Office (ATO) to the ABS under the Taxation Administration Act 1953, which requires that 
such data is only used for the purpose of administering the Census and Statistics Act 1905. Legislative 
requirements to ensure privacy and secrecy of this data have been adhered to. In accordance with the 
Census and Statistics Act 1905, results have been confidentialised to ensure that they are not likely to 
enable identification of a particular person or organisation. This study uses a strict access control protocol 
and only a current ABS officer has access to the underlying microdata. 

Any findings from this paper are not official statistics and the opinions and conclusions expressed in 
this paper are those of the authors. The ABS takes no responsibility for any omissions or errors in the 
information contained here. Views expressed in this paper are those of the authors and do not necessarily 


represent those of the ABS. Where quoted or used, they should be attributed clearly to the authors. 
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1 INTRODUCTION 


Statistical agencies are constantly facing decisions on how to best balance the trade- 
off between protecting data confidentiality and providing greater access to the valuable 
data they collect to inform decision making. Clause 7 of the Statistics Determination 
1983 ensures safe access to ABS data in the form of unidentified individual statistical 
records (microdata), for research and analysis purposes. Clause 7 stipulates that ”the 
information is disclosed in a manner that is not likely to enable the identification of the 
particular person or organization to which it relates” The Australian Government (1983). 
Protections are important for producing high quality statistics. However, protections have 
to be balanced with appropriate levels of data access and dissemination. As economist 
George Stigler pointed out in 1980, data is both a private and public good. On the one 
hand, statistical agencies must protect confidentiality, but at the same time they also 
need to ensure that data is accessible so that it can be used to inform decisions that have 


significant impact on the public interest (Abowd and Schmutte, 2015). 


The ABS has increasingly emphasised providing better access to microdata for research. 
The ABS uses the Five Safes Framework to ensure microdata can be used appropriately 
by taking into consideration safe people, projects, settings, data and output (ABS, 2016, 
Desai et al., 2016). The ABS provides three types of microdata products - TableBuilder, 
Confidentialised Unit Record Files or CURFs and detailed microdata (ABS, 2017). The 
microdata access methods include both remote or on-site depending on the type of mi- 
crodata product. Users can analyse highly detailed microdata in the on-site ABS Data 
Laboratory or DataLab environment. For access from user’s own environment, the ABS 
has provided a suite of microdata products such as TableBuilder (tabulation of Census or 
surveys) and Remote Access DataLab (analysing the more detailed CURFs). For micro- 
data access, researchers can download basic CURFs for analysis in their own environment. 
However, these basic CURF’s contain little detail and are reported at a more aggregate 
level (Tam et al., 2009). These microdata products facilitate research that maximises the 
value of data for informing decisions of importance to Australia. Examples include Healy 
et al. (2015), Breunig and Bakhtiari (2013), Blackmore and Nesbitt (2013). 


The ABS releases CURFs, using suppression, aggregation, and top and bottom coding 
methodologies, to enable analysis of microdata (O’Keefe and Shlomo, 2012). However, 
these techniques can make microdata from business surveys or administrative data sources 
(or business microdata) less useful because some Australian industries are characterised 
by oligopoly or duopoly. Useful information is often suppressed or aggregated to avoid 
re-identification of large businesses. The ABS could consider releasing synthetic datasets 
for researchers to enhance access to business microdata. Synthetic datasets preserve the 


relationships between variables so that researchers can make valid inferences about the 
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target population without accessing the underlying microdata (Loong, 2012). The US 
Census Bureau uses synthetic data to make its business microdata more accessible to 


researchers and provides a validation service. 


This research explores the use of a synthetic data approach as a possible dissemination 
tool for Australian business microdata. The first section describes the statistical models 
and processes to impute missing data. Using imputed data to create synthetic microdata 
provides the advantages of enhancing utility and protection of the synthetic microdata. 
The second section describes the different disclosure control approaches including confi- 
dentialise input and confidentialise output. The third section provides utility and risk 


results for the different approaches. The final section contains conclusions. 
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2 STATISTICAL MODEL AND DATA ANALYSIS 


We are interested in preserving the statistical relationships between the variables in the 


firm production function. The statistical model is specified as: 


Iny je = INL jpg + OQMNK jy + Ag3nM jy, + aghnFirm_Age jx: + Te + €jxe, (1) 


where Iny,,, is the logarithm of total sales adjusted for the repurchase of stocks divided 
by the total number of employees for firm 7 in industry k at time t. The logarithm of 
estimated firm average labour components InL,,, for firm j in industry k at time ¢ is 
derived using the method proposed by Abowd et al. (2002). Details can be found in ?. 
The logarithm of capital cost Ink’ ;,; is the logarithm of the sum of equipment depreciation, 
business rental expenses and capital investment deductions divided by the total number 
of employees for firm 7 in industry k at time t. The logarithm of material costs InMj,, is 
the logarithm of the inputs used in the production process divided by the total number of 
employees for firm 7 in industry k and time ¢. The logarithm of firm age is hFurm_Age jx¢ 
for firm j in industry k at time t. We also include time fixed effects 7, for industry k at 
time t (Breunig and Wong, 2008, Nguyen and Hansell, 2014, Mare et al., 2016). This gives 
15 unknown regression parameters in (1). This study used a one percent stratified sample 
of business microdata from an expanded prototype dataset (N > 45000 firms). Chien 
and Mayer (2015), Chien et al. (2019) provide more details of the prototype dataset. We 
simplify notation in (1) by removing the subscripts. We also use different fonts i-e., 
X, to represent observed N x 15 matrix containing all the independent variables in (1). 


Similarly, we use y to represent the observed vector containing dependent variable in (1). 


The prototype sample contains missing values, particularly for material inputs. Figure 1 
shows the missing data pattern; the three variables with missing values include (InM, 


Ink and Iny) in descending order. 
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Figure 1: Missing data pattern 
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Note. The green tile indicates missing data. The blue tile indicates non missing data. Consider ABS and 

Patents subfigure at the top left, the left panel is a bar chart showing the propostion of missing data for 

each variable. The right panel shows the 8 missing data patterns in the data and the proportion of each 
pattern. 


The missing values in the 1% sample are imputed assuming the data are missing at random 
(MAR). The consequence of this assumption is that missing values can be imputed using 
models fitted to the observed data (Little and Rubin, 2014). We adapt a similar notation 
to Reiter (2005a). The experimental dataset consists of [y, XV], where y is N x 1 vector 
which includes the dependent variable, and XY is N x 15 matrix which includes all the 
independent variables in (1). We have imputed the missing variables Iny, nk and InM. 
We use two Bayesian imputation approaches - Predictive Mean Matching and Expectation 


Maximisation and Bootstrap to impute the missing data. 


The observed dataset consists of two N x 16 matrices, D = [y, XV], where X includes all 
the independent variables in (1), and the response indicator matrix R which we use to 
partition D into the observed D,,, and the missing D,,,,. We use ©, X) and X™ 
to denote the matrix for imputing missing data in Iny, InK and InM, respectively. So if 
the missing data variable is ny then V includes all the independent variables in (1). In 
comparison, if the missing data variable is nk then X'*) includes all the independent 
variables and Iny but excludes InK. If the missing data variable is nM then 1 includes 
all the independent variables and Iny but excludes InM. We impute the missing values 
in Iny, nk and InM separately, using two Bayesian imputation approaches - Predictive 
Mean Matching (PMM) and Expectation Maximisation and Bootstrap (EMB). 


PMM selects from a set of possible donors from the complete cases whose predictive means 
are closest to that of the missing case (Little, 1988). The value of the selected y,,, are then 


imputed for y,,,;,. This method is similar to a hot-deck imputation because it randomly 
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choose one from nearest neighbour complete cases. Vink et al. (2014) shows that 


Yimp 
similar to (4), the PMM formula for imputing a target variable y can be expressed as: 


y=NXB +e, (2) 
The box 1 describes the concept of the algorithm. 


Algorithm 1 PMM algorithm 


1: procedure 


2: use D.,, to estimate B and é. 

3: use draw variance 6 from é'é€/A where A is x? with N —k with k is the number of 
parameters. 

A: draw B from a multivariate normal distribution centered at B with covariance matrix 


Pe = 
o (Xi eU cis) zs A 
calculate Gobs = X oysB and Umis = XmisB- 
for each each y,,;, do 
find distance A; = |§45,i — Ymis,k| Where i # k. 


randomly sample one donor from A, with 7 = 1,---,5 smallest elements and take 
the corresponding %,,, to imput y,,;¢- 


Figure 2 shows how PMM imputes the missing values y;,,,, by randomly select one out 
Yimp has the smallest A in 
this example. PMM has the advantage of imputing real values observed from the data 
(Schenker and Taylor, 1996, White et al., 2011, Allison, 2015). PMM also gives more 
robust estimates in the presence of misspecification in the imputation model (Koller- 
Meinfelder, 2009). 


of five plausible donors y,,, with smallest distance A. The 
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5 10 


x 
Note. indicates observed values y,,,,, ¢indicates imputed value y,,,,,, and « indicates fitted values 7,,,, and 


Ymis* 
Source: adapted from (Koller-Meinfelder, 2009, p.32) 


King et al. (2001) propose EMB which combines Expectation Maximisation (EM) algo- 
rithm with bootstrap sampling. Unlike PMM, EMB uses predicted values of a linear 
regression fitted to the observed data to impute missing values. EMB assumes variables 
in D are multivariate normal and data are missing at random (King et al., 2001). The 
imputation formula is 


De) = Do BAe, (3) 


Mist 
(3) 


where ~ indicates a random draw from the appropriate posterior. The symbol DO 
) denotes the vector of values 


denotes a imputed value for row 7 and column j and Dee 


observed of all columns in row 7 except column j. The coefficients B from a can be 
calculated from the complete data parameters be J = (yu, %), where py is the mean vector 
and » is the variance-covariance matrix. The randomness of De: is created by both 
estimation uncertainty due to unknown V and uncertainty in €; because ¥ is not a matrix 
of zero (Honaker and King, 2010). 
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The box 1 simplifies the notation by removing the superscripts and subscripts for D,,,; 


Us 


and D.,, to describe the concept of the algorithm (Tan et al., 2009). 


Algorithm 2 EMB algorithm 


1: procedure 


2: 


9: 
10: 


generate m bootstrap sample of size n with replacement from the posterior 
Pr(0) [ Pr(D | 0)dD,,,;, described in (9b). 


keep draws of 0 with probabilities proportional to the importance ratio - the ratio of 


mis 


the posterior to the asymptotic normal approximation evalued at 0. King et al. (2001) 
defines the importance ratio (IR) without prior as 
= Lo | Dia) 
NOIDVO) 
draw § from a multivariate normal distribution centered at 8 with covariance matrix 


ad (8 aR eee aa 


obs 


in each sample m, fill in D by running an EM algorithm described below. 


mis 
~ (i) ¥ 

Let J be the current guess of 7 

Expectation step computes the Q function defined by 


alt) im ~ ~(4) 
Qi | v) = Elev; D DD is) | D 50 ] 


obs? 
= [ei Dies D) 3) x ID nis | Doug 3 )\dD nig 


Maximisation step maximises @ with respect to 3 to obtain 


tea) eee 
=argmarQ(v |v). 


repeatboth Expectation and Maximisation steps 
until convergence occurs 


Baraldi and Enders (2010) discussed how multiple imputation methods create many copies 


of datasets with different imputed values. These datasets are analysed using the same 


estimation step to generate multiple sets of parameters and normal standard errors. The 


final result is derived by using model averaging to incorporate the uncertainty associated 


with the model selection process into standard errors and confidence intervals (Schomaker 


and Heumann, 2014). It is unclear if model averaging from multiple imputed datasets 


provides the best results. This study applies each method 20 times to the 1% sample and 


we select the best imputed dataset which maximises the likelihood for (1) from the 40 
datasets (Fay, 1992, Meng, 1994). 
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3 DISCLOSURE CONTROL 


O’Keefe and Shlomo (2012) categorise statistical disclosure control methodologies into two 
main approaches - confidentialised input or confidentialised output. Examples of confiden- 
tialised input methods include aggregation, geographical suppression, rounding, swapping 
and adding noise (see Figure 3). However, it is often difficult to quantify the amount of 
information loss or level of protection achieved using confidentialised input approaches. 
Rubin (1993) proposed a method to generate synthetic data by repeatedly sampling from 
a statistical model estimated from actual microdata. The synthetic datasets can be used 


for inference while protecting confidentiality. 


Figure 3: Confidentialise input approach 
— = oe i ae => 
. Confidentiality 
<a as ec F 


/ Me RS 
Confidentialised output approaches allow data access in a remote analysis system. The 


system takes a query and returns the results to the analyst. The analyst does not have 
direct access to the microdata. The remote system imposes restrictions on the queries 


and applies routines to deliver confidentialised results (see Figure 4). 


Figure 4: Confidentialise output approach 
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The study compares confidentialised input and output approaches using imputed micro- 
data. The aim is to compare the protection of confidentiality using different approaches. 
We explore both synthetic data and perturbation. Reiter (2009) discussed the fully syn- 
thetic or partially synthetic data approaches. Consider the following example, an analyst 
wants to estimate a Cobb-Douglas production function from one of the ABS business sur- 
veys. The survey contains 6,500 businesses. Fully synthetic randomly simulates values for 
business turnover and capital investment for 500 businesses from the joint distributions 
of the model. These distributions are estimated using the survey data or other relevant 
information. The result is one synthetic dataset. This process is repeated, each time 
using a different 500 businesses, to generate multiple synthetic datasets. 
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In comparison, the partially synthetic approach only replaces sensitive values in the mi- 
crodata with multiple imputations. Consider the same example and assume the capital 
investment variable is considered to be sensitive. The statistical agency wants to suppress 
any income value exceeding $20 million. So any businesses in the sample with investment 
exceeding the threshold will be excluded in the simulated datasets. The disclosure pro- 
tection of PS depends on the nature of the synthesis. Replacing identifying variables with 
imputations makes it very unlikely for users to identify the original values, however it 
does not guarantee 100% protection. Synthetic data preserves the underlying statistical 


relationships found in the observed data. 
3.1 Synthetic data 


This paper explores two synthetic data generation methods for Australian business micro- 
data - the sequential regression (SR) of Raghunathan et al. (2001) and non-parametric im- 
putation based on classification and regression trees (CART) proposed by Reiter (2005b). 
We use a different font, i.e. X, to represent imputed N x 15 matrix containing all the in- 
dependent variables in (1). Similarly, we use y to represent the imputed vector containing 


dependent variable in (1). 


We create fully synthetic data for three variables Iny, Ink’ and InM from an imputed 
experimental dataset (see APPENDIX A). The firm output Iny is the logarithm of total 
sales adjusted for the repurchase of stocks divided by the total number of employees. 
Firm capital Ink is the sum of equipment depreciation, business rental expenses and 
capital investment deductions divided total number of employees. Material costs In 
are the inputs used in the production process divided by the total number of employees. 
These variables have higher disclosure risks because the business information is more 
sensitive. The synthetic variables combined with the original variables InFirm__Age and 


time indicator variables to estimate (1). 


The SR formula for generating synthetic data for y is 


y = X8, (4) 


where X is the matrix whose columns contain the observed variables used to predict y and 
8 is the vector of weights given each of the observed variables used to predict y. We apply 
(4) three times with y denoting each of the three variables Iny, Ink and nM. We use 
X, X) and X™) to denote the matrix for creating synthetic data in Iny, nk and nM, 
respectively. So if the synthetic data variable is Iny then X includes all the independent 
variables in (1) in APPENDIX A. In comparison, if the synthetic data variable is nk 
then X“*) includes all the independent variables and Iny but excludes Ink. Similarly, if 
the synthetic data variable is nM then X“ includes all the independent variables and 
Iny but excludes InM. 
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The SR method uses appropriate regression models for different variable types. For ex- 
ample, continuous variables are generated using a normal model and binary variables 
using a logit model. This study only creates synthetic data for continuous variables. The 
SR method generates a continuous vector y*©? from the parameters directly estimated 
from the fitted regression as follows. First draw a new value 6 = (07, 8) from Pr(6| y). 
Specifically, the variance is drawn from 0? |X ~ (y — XB) (y — XB)xz2 ps where n is the 
total number of observations and k is the dimension of 8. The coefficients are drawn from 
B\o2,X ~ N(B, (X’X)~!0?). Second, the synthetic values for y°°? are drawn from the 
regression model y*°? | 8,07, X ~ N(XB,o7). The imputations are generated for each 


variable sequentially (Drechsler, 2011). 


The CART algorithm estimates the conditional distribution of a univariate outcome given 
multivariate predictors by partitioning the predictors into groups with similar outcomes. 
The partitions are created by recursive binary splits of the predictors in a tree structure 
with leaves. The values in each leaf represent the conditional distribution of outcomes 
that satisfy the partitioning criterion. Effectively, CART preserves the underlying re- 
lationships between variables by creating models with many interaction effects (Reiter, 
2005b, Burgette and Reiter, 2010). 


cart 


To create y°""", we first fit a tree relating y to X. We do this separately for all three 
variables Iny, nk and InM. The algorithm minimises the deviation of y within each leaf 
and stops splitting when the deviation is below 0.001. We do this for three variables and 


label these trees tree'Y)(4).(M) 


. We use y;-,¢ to represent the predicted values of terminal 
leaves lea f\Y)(4)™) in the trees. In each leaf of the tree, we use the Bayesian bootstrap 
to draw new values from y).,f to create synthetic data (Reiter, 2005b). The Bayesian 
bootstrap differs from the standard bootstrap by varying the selection probabilities in the 
re-sampling process (Rubin, 1981). The main advantage of using the Bayesian bootstrap 
is adding uncertainty in each leaf because the number of values in each leaf tend to be 


small (Reiter, 2005b). 


We generate 20 synthetic datasets using each method. We use these datasets 
to fit (1) and choose the synthetic dataset with highest log-likelihood for the 


analysis. 
3.2 Perturbation 


The confidentialised input approach produces synthetic microdata that allows researchers 
to analyse the microdata. In comparison, the confidentialised output approach, e.g. per- 
turbation, does not allow researchers to access the underlying microdata. Researchers can 
only explore data and perform modelling analyses within a secured remote environment. 
In this environment, on-the-fly routines are applied to confidentialise results for analysis. 


These routines protect confidentiality while maximising the utility of the microdata. 
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The perturbation algorithm starts by considering the estimation for model (1) as solving 
Sc(a; X;y) = 0, where Sc(a; X;y) = X'(y—X! qa). The algorithm then adds the noise 
e to the score function. We use a@?°"! to denote the coefficients after the score function 


has been perturbed. The perturbed estimating equation can be expressed as 


Sc(are'; X;y) =e. (5) 


The amount of perturbation is based on a record’s contribution to the coefficients in 
the estimating equation. The perturbation is added using e = X'(y — X'a@)u, where 
noise u is generated independently from the uniform distribution with the range (—1, 1). 
The estimated coefficients after perturbation are @?°™t = @ + (X'X)~'e where @ is the 
estimated coefficient using the original microdata. The solution of (5) @?°"* is an unbiased 
estimate of a because the noise is small and its expected value has mean zero E(e) = 0. 
The perturbation has a similar effect to removing records that have large contribution to 
the estimated coefficients (Chipperfield and O’Keefe, 2014, Chipperfield, 2014). 


4 EMPIRICAL RESULTS 


Analysing business microdata can lead to re-identification because it contains large busi- 
ness units, unlike household or person microdata which have a large number of similar 
respondents. This means protecting confidentiality of business microdata can be different 
from protecting household or person microdata. Statistical agencies often take strong 
protection measures to minimise the likelihood of disclosure (O’Keefe and Shlomo, 2012). 
There are different approaches to estimate disclosure risks and one approach is to estimate 
risk scores for individual records using the probability of matching between sample and 
population microdata. These individual record risk scores are then aggregated for the 
entire data file, see (Bethlehem et al., 1990, Shlomo, 2010, Drechsler, 2011). 


The confidentialised output approach does not generate individual confidentialised units 
so it is not feasible to calculate the individual risk score. Instead, we follow the approach 
of O’Keefe and Shlomo (2012) and show the absolute differences between the identify- 
ing variables in the confidentialised and original microdata across selected industries in 
the disclosure risk models. Iny, Ink and InM on the logarithms of the total number of 
employees in each firm 7 in these models. There are strong positive correlations between 
firm size and variables with higher disclosure risks such as Iny, nk’ and InM. Figures 5 
and 7 in APPENDIX C shows that the synthetic data approach generally provides more 
confidentiality protection as the absolute differences |6| are often wider than in the per- 


turbation approach. 


This study compares the estimated coefficients using confidentialised input and outputs 
approaches with the estimated coefficients using original microdata for measuring utility. 
Figures 9, 10, 11 and 12 in APPENDIX C compare the estimated coefficients using 
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different approaches for selected industries. The Figures show the results for the main 
variables and intercepts. In the large sample size, the sequential regression provides the 
best protection but with a higher utility trade off. The estimated coefficients using the 
perturbation approach have relatively smaller biases but the results are comparable. In 
comparison, the results are mixed in small sample sizes in industries like mining or public 
administrative. In general, the standard deviations for the coefficients are larger using the 
synthetic data approach. The coefficient plots for the rest of the variables can be found 
in the APPENDIX C, see Figures 13, 14, 15 and 16. 


Figure 17, in the APPENDIX D, shows the model residuals using hex-bin plots. Figure 18 
shows the quantile-quantile normal plots for all industries. There are no notable differ- 
ences when we compare different approaches with the model results using unconfiden- 
tialised data. However, the analysis shows that there are differences when we consider 
mining industry. The confidentialised inputs approach provides better protection with 
trade off in variance estimation see Figure 19 and Figure 20 in APPENDIX D. 


5 CONCLUSIONS 


This research compares synthetic data and perturbation approaches for disseminating 
Australian business microdata. The preliminary results show that synthetic data can be 
a possible dissemination tool to make more business microdata accessible while ensuring 


confidentiality. 


The analysis shows that the confidentialised input approach provides more protection 
than the confidentialised output approach in this particular setting - one percent sample 
file of business microdata. This is partly because the researchers have access to the 
microdata so there is a stronger need to add more noise for protection. The amounts of 
utility loss from synthetic data and perturbation approaches are comparable because the 
estimated coefficients are similar. Synthetic data could be a possible approach for the 
ABS to consider to enhance access to business microdata. This preliminary research has 


several areas for possible extension including: 


e exploring multilevel models for creating synthetic data to better capture the hierarchi- 
cal structure of the dataset (Drechsler, 2015). 


e considering other non-parametric approaches for synthetic data such as random forest 


or differential privacy (Drechsler and Reiter, 2011). 


e exploring synthetic data approaches which also maintain differential privacy standard 
(Sarwate and Chaudhuri, 2013). 
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A SUMMARY STATISTICS 


Table 1: Summary Statistics - Test Data 


Statistic N Mean St. Dev. Min Max 
InFirm_Age 47,160 7 5 1 20 
mais) 47,160 10 1 15 
ny jp 47,160 11 1 1 18 
Ink jx 47,160 8 1 —3 16 
InM jx 47,160 10 2 —3 18 
ABN 47,160 

year 47,160 2008 3 2002 2013 


'InFirm__Age is the logarithm of firm age. Firm age is derived 
as the current year minus the year of incorporation. 

2 na *) the logarithm of labour inputs. 

3 ny jt is logarithm of per employee value added (i.e. sales 
adjusted for repurchase of stock) deflated by industry Gross 
Value Added implicit price deflators 

4 Ink jkt 18 the logarithm of per employee cost of capital that 
includes depreciation, capital rental expenses and capital work 
deductions deflated by the industry consumption of fixed cap- 
ital implicit price deflators. 

> InM. jet 1S logarithm of per employee material costs deflated 


by Producer Price Indexes Intermediate Goods. 
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B ABAYESIAN FRAMEWORK FOR IMPUTATION 


We assume data are missing at random. The consequence of this assumption is that 
missing data can be imputed from fitting model on the observed data. The complete data 
parameters are V = (uw, 4), where pw is the mean vector and © is the variance-covariance 


matrix. The likelihood of these parameters given the observed data can be expressed as 
Pr(Dipss R| 9) = f Pr(D, R| DAD (6a) 
= [Prue | R,V)Pr(R | V)dD,,;.- (6b) 


Using Bayes’ theorem we can rewrite the first term Pr(D | R,v) in (6b) as 
Pr(D|V)Pr(R | D,V)/Pr(R|V). Substituting the new term into (6b) we have 


Pr(Dopa, R | 9) = ip Pr(D|8)Pr(R| D,ddD,n 6: (7) 


obs) 


Assuming the data are missing at random, the patterns of missing data depend only on 


the observed data, so (7) is simplified to 


mis 


Pr(Doyg, RK | 8) = / Pr(D | B)Pr(R| Dyg, DAD 
= / Pr(D | V)dDyyigPr(R | Doyg) 


= Pr(Doos | ¥)Pr(® | Doos): (8) 


obs 


Maximising (6a) over J is the same as maximising the first term in (8) over J. The 
likelihood can therefore be expressed as L(V | D.,,) « Pr(D,,, |v). Harel and Zhou 


(2007) describe the posterior distribution to draw imputations is 


obs 


PrD | Dy.) = [PD | Dis v)Pr(v | D4,)a%, where (9a) 


mis 


Prd | Ds.) o Pry) [rv [PaD gs (9b) 


is the observed posterior distribution for J and Pr(v) is an uninformative Jeffreys’s prior 
for bi. 
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C EMPIRICAL RESULTS 


Figure 5: Disclosure measures - All industries 
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Figure 6: Disclosure measures - Mining 
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Figure 7: Disclosure measures - Construction 
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Figure 8: Disclosure measures - Public administration 
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Figure 9: Coefficients plots - All industries 
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Figure 10: Coefficients plots - Mining 
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Figure 11: Coefficients plots - Construction 


=" 


m ? |! 


i ‘i: 


estimated coefficients 


' 4.5- 
0.0 - | 
4.0 - 


1 1 ' ' 
InK InM InL (Intercept) 


—— no protection —— sequential regression 


—— classification and regression tree —— perturbation 


Figure 12: Coefficients plots - Public administration 
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Figure 13: Coefficients plots - All industries 
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Figure 15: Coefficients plots - Construction 
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Residuals 


Residuals 


D SELECTED DIAGNOSTICS 


Figure 17: Confidentialised residual plots - ALL industries 
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Note. Residuals come from fitting (1) to different approaches. The plotting region on these figures is 
broken into a mesh of tessellating hexagons, each of which is coloured indicating how many observations 
lie in that hexagon. 
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Figure 18: QQ Norm plots - ALL industries 
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Note. Residuals come from fitting (1) to different approaches. A 45 degree line indicates that residuals are 
normally distributed. 
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Residuals 


Residuals 


Figure 19: Confidentialised residual plots - mining 
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Note. Residuals come from fitting (1) to different approaches. The plotting region on these figures is 
broken into a mesh of tessellating hexagons, each of which is coloured indicating how many observations 
lie in that hexagon. 
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Figure 20: QQ Norm plots - mining 
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(3) synthetic data - SR 
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Note. Residuals come from fitting (1) to different approaches. A 45 degree line indicates that residuals are 
normally distributed. 
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FOR MORE INFORMATION ... 


www.abs.gov.au the ABS website is the best place for data from our 
publications and information about the ABS. 


INFORMATION AND REFERRAL SERVICE 


Our consultants can help you access the full range of information published by 
the ABS that is available free of charge from our website. Information tailored 
to your needs can also be requested as a 'user pays' service. Specialists are 
on hand to help you with analytical or methodological advice. 


POST Client Services, ABS, GPO Box 796, Sydney NSW 2001 
FAX 1300 135 211 
EMAIL client.services@abs.gov.au 


PHONE 1300 135 070 


FREE ACCESS TO STATISTICS 


All ABS statistics can be downloaded free of charge from 
the ABS web site. 


WEB ADDRESS ~_ www.abs.gov.au 
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